Lab 4: Intro to Machine Learning

Practice session covering topics discussed in Lecture 4

M. Chiara Mimmi, Ph.D. | Università degli Studi di Pavia

July 27, 2024

GOAL OF TODAY’S PRACTICE SESSION

  • Review the basic questions we can ask about ASSOCIATION between any two variables:
    • does it exist?
    • how strong is it?
    • what is its direction?
  • Introduce a widely used analytical tool: REGRESSION



The examples and code from this lab session follow very closely …..:

Topics discussed in Lecture # 4

Lecture 4: topics

  • Shifting the emphasis on empirical prediction
    • Distinction between supervised & unsupervised algorithms
      • Unsupervised ML Example
        • PCA
  • Useful R resources for metabolomics
    • Introduction to MetaboAnalyst software
  • Elements of statistical power analysis

R ENVIRONMENT SET UP & DATA

Needed R Packages

  • We will use functions from packages base, utils, and stats (pre-installed and pre-loaded)
  • We will also use the packages below (specifying package::function for clarity).
# Load them for this R session

# General 
library(fs)      # file/directory interactions
library(here)    # tools find your project's files, based on working directory
here() starts at /Users/luisamimmi/Github/R4biostats
library(paint) # paint data.frames summaries in colour
library(janitor) # tools for examining and cleaning data

Attaching package: 'janitor'
The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(dplyr)   # {tidyverse} tools for manipulating and summarizing tidy data 

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(forcats) # {tidyverse} tool for handling factors
library(openxlsx) # Read, Write and Edit xlsx Files
library(flextable) # Functions for Tabular Reporting
# Statistics
library(rstatix) # Pipe-Friendly Framework for Basic Statistical Tests

Attaching package: 'rstatix'
The following object is masked from 'package:janitor':

    make_clean_names
The following object is masked from 'package:stats':

    filter
library(lmtest) # Testing Linear Regression Models # Testing Linear Regression Models
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
library(broom) # Convert Statistical Objects into Tidy Tibbles
#library(tidymodels) # not installed on this machine
library(performance) # Assessment of Regression Models Performance 
# Plotting
library(ggplot2) # Create Elegant Data Visualisations Using the Grammar of Graphics

DATASETS for today

We will use examples (with adapted datasets) from real clinical studies, provided among the learning materials of the open access books:

Importing Dataset 1 (NHANES)

Name: NHANES (National Health and Nutrition Examination Survey) combines interviews and physical examinations to assess the health and nutritional status of adults and children in the United States. Sterted in the 1960s, it became a continuous program in 1999.
Documentation: dataset1
Sampling details: Here we use a sample of 500 adults from NHANES 2009-2010 & 2011-2012 (nhanes.samp.adult.500 in the R oibiostat package, which has been adjusted so that it can be viewed as a random sample of the US population)

  • Adapting the function here to match your own folder structure

NHANES Variables and their description

[EXCERPT: see complete file in Input Data Folder]

MACHINE LEARNING: A FOCUS ON PREDICTION

Splitting the dataset into training and testing samples

Julia Silge https://supervised-ml-course.netlify.app/

_______

ML WITH SUPERVISED ALGORITHMS

PCA: step by step (example)

  1. PCA fatta a mano. PCA step by step come in Statology ma con il data set della Lecture nmr_bins…csv

https://www.statology.org/principal-components-analysis-in-r/

Probabilmente non viene proprio uguale perchè in MA fa normalizzazione e scaling mentre Statology fa solo scaling, ma fa niente, diciamo che ci serve per vedere la differenza

PLS-DA: step by step (example)

  1. PCA + PLS_DA + CLuster https://rpubs.com/Anita_0736/PD_ANALYSIS

  2. PLS fatta a mano PLS step by step come in Statology ma con il data set della Lecture nmr_bins…csv

https://www.statology.org/partial-least-squares-in-r/

In MetaboAnalyst usano la PLS-DA che non so cosa ha di diverso ma può essere anche carino vedere la differenza

_______

ML WITH UNSUPERVISED ALGORITHMS

Hierarchical Clustering (example)

  1. Hierarchical Clustering fatto a mano come in Statology ma con il data set della Lecture nmr_bins…csv

https://www.statology.org/hierarchical-clustering-in-r/

Se non hai tempo o non si riesce l’alternativa è che li faccio giocare anche loro con il MetaboAnalyst anche nelle esercitazioni, sperando che la rete regga e la piattaforma pure..

_______

SAMPLE SIZE… 🙀 a.k.a. “the 1,000,000 $ question”!

_______

Final thoughts/recommendations

  • The analyses proposed in this Lab are very similar to the process we go through in real life. The following steps are always included:

    • Thorough understanding of the input data and the data collection process
    • Bivariate analysis of correlation / association to form an intuition of which explanatory variable(s) may or may not affect the response variable
    • Diagnostic plots to verify if the necessary assumptions are met for a linear model to be suitable
    • Upon verifying the assumptions, we fit data to hypothesized (linear) model
    • Assessment of the model performance (\(R^2\), \(Adj. R^2\), \(F-Statistic\), etc.)
  • As we saw with hypothesis testing, the assumptions we make (and require) for regression are of utter importance

  • Clearly, we only scratched the surface in terms of all the possible predictive models, but we got a hang of the fundamental steps and some useful tools that might serve us also in more advanced analysis

    • e.g. broom (within tidymodels), performace rstatix, lmtest